50 research outputs found
Automatic Generation of Thematically Focused Information Portals from Web Data
Finding the desired information on the Web is often a hard and time-consuming task. This thesis presents the methodology of automatic generation of thematically focused portals from Web data. The key component of the proposed Web retrieval framework is the thematically focused Web crawler that is interested only in a specific, typically small, set of topics. The focused crawler uses classification methods for filtering of fetched documents and identifying most likely relevant Web sources for further downloads. We show that the human efforts for preparation of the focused crawl can be minimized by automatic extending of the training dataset using additional training samples coined archetypes. This thesis introduces the combining of classification results and link-based authority ranking methods for selecting archetypes, combined with periodical re-training of the classifier. We also explain the architecture of the focused Web retrieval framework and discuss results of comprehensive use-case studies and evaluations with a prototype system BINGO!. Furthermore, the thesis addresses aspects of crawl postprocessing, such as refinements of the topic structure and restrictive document filtering. We introduce postprocessing methods and meta methods that are applied in an restrictive manner, i.e. by leaving out some uncertain documents rather than assigning them to inappropriate topics or clusters with low confidence. We also introduce the methodology of collaborative crawl postprocessing for multiple cooperating users in a distributed environment, such as a peer-to-peer overlay network. An important aspect of the thematically focused Web portal is the ranking of search results. This thesis addresses the aspect of search personalization by aggregating explicit or implicit feedback from multiple users and capturing topic-specific search patterns by profiles. Furthermore, we consider advanced link-based authority ranking algorithms that exploit the crawl-specific information, such as classification confidence grades for particular documents. This goal is achieved by weighting of edges in the link graph of the crawl and by adding virtual links between highly relevant documents of the topic. The results of our systematic evaluation on multiple reference collections and real Web data show the viability of the proposed methodology
Explicit versus Latent Concept Models for Cross-Language Information Retrieval
Cimiano P, Schultz A, Sizov S, Sorg P, Staab S. Explicit versus Latent Concept Models for Cross-Language Information Retrieval. In: Boutilier C, ed. IJCAI 2009, Proceedings of the 21st International Joint Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press; 2009: 1513-1518
Recommended from our members
Determining citizensâ opinions about stories in the news media: analysing Google, Facebook and Twitter
We describe a method whereby a governmental policy maker can discover citizensâ reaction to news stories. This is particularly relevant in the political world, where governmentsâ policy statements are reported by the news media and discussed by citizens. The work here addresses two main questions: whereabouts are citizens discussing a news story, and what are they saying? Our strategy to answer the first question is to find news articles pertaining to the policy statements, then perform internet searches for references to the news articlesâ headlines and URLs. We have created a software tool that schedules repeating Google searches for the news articles and collects the results in a database, enabling the user to aggregate and analyse them to produce ranked tables of sites that reference the news articles. Using data mining techniques we can analyse data so that resultant ranking reflects an overall aggregate score, taking into account multiple datasets, and this shows the most relevant places on the internet where the story is discussed. To answer the second question, we introduce the WeGov toolbox as a tool for analysing citizensâ comments and behaviour pertaining to news stories. We first use the tool for identifying social network discussions, using different strategies for Facebook and Twitter. We apply different analysis components to analyse the data to distil the essence of the social network usersâ comments, to determine influential users and identify important comments